1 Introduction

We are constantly surrounded by information in today’s digital world. But not all of it is accurate. We rely on data to gauge whether something is true or false. Also, more and more data is being collected and stored and the information contained in these data needs to be unlocked.

We rarely see this data in its raw form. You can imagine how rows and rows of numbers could be very confusing to interpret. Because of this we usually use a method called data visualisation to present patterns and trends in data more easily.

Data visualisation is the translation of data into visual representations, like charts and graphs, to communicate the information contained in the data.

Whilst this method simplifies the process of understanding data it can also bend the truth and misrepresent information if care is not exercised.


An effective graph is a visualisation of data that unlocks and conveys information or a story from data. This may help verify or debunk beliefs or assumptions, answer questions about a topic/issue or guide further research.

The first element needed in data visualisation is … data!

  • Make sure that you understand its structure and that the data is “wrangled and clean”. In most situations, the data will come in inadequate formats and/or contain other anomalies such as missing values, outliers, etc. Quite likely this will arise whilst visualising the data. Organise the data in a form suitable for the display you want to create.
  • Do not “cherry pick” your data to suit some belief.
  • Be honest and make sure that you understand how representative the data is.

When you start plotting the data, don’t expect your first or even third attempt to produce the final display. You will arrive at your preferred display after a few iterations.

What elements should a good plot contain?

  1. Content: The plot tells a story easily. Given that you can plot the same data in many different ways, chosse the display that is most informative to answer the question at hand. Avoid displays that are overly complicated, even though ingenious. Is the plot conveying the information you want? Have you given enough context for others to be able to interpret the charts?
  2. Minimalism: Does the type of plot you chose contain superfluous elements? If so, remove them so that the story is enhanced.

I will illustrate these points in what follows.

2 Case Study: Multiple graphs in one plot. Massachusetts Bay Transport Authority (MBTA) data

Let us consider monthly transportation data in Boston, USA, consisting of monthly averages of weekday number of passengers (in thousands) by mode of transportation from 2007 until 2011.

The data is already in tidy format.

Let us visualise the data.

2.1 Box plots

Box plots are useful to have a quick (and resistant to outliers) view of the distribution of data. Note that box plots should only be used if the distribution is unimodal.

Boxplots show the median and first and third quartile of a data set, as well as whiskers which extend 1.5 times the interquartile range beyond the first and third quartile. Data beyond the whiskers are considered to be outliers and deserve some special consideration.


Box plots by mode of transportation show the distribution of average weekday number of passengers by mode of transportation. This visualisation helps us to understand the use of the different modes of transportation.



The plot is useful because it immediately conveys the message that Heavy Rail is the mode of transportation mostly used followed by Bus, Light Rail and Commuter Rail. Boat, Private Bus, Trackless Trolley and RIDE (shared transportation for passengers who cannot use the other modes of transportation) are the least used modes of transportation.

For discussion: What is not just quite right with this plot?

The scale of the number of passengers for the most and least used modes of transportation is too different and so it is not possible to make any inferences about the distribution of the number of passengers for the 4 least used modes of transportation.


Therefore, if the visualisation above is used, a separate visualisation of the least used modes of transportation is needed.


It wasn’t possible to view the features of the distributions of the number of passengers for the 4 least used modes of transportation in the first plot.

There are some unusually low observations for Trackless Trolley. Private Bus has a very skewed distribution.

For discussion: What would you change in these plots?

I propose now to apply the principle of minimalism to this plot.



I have removed the background colour, the vertical grid, the legend, the x-axis label, and added a more succinct title. As the background is white, I have also filled the boxes with colour rather than colour just the border of the box.

2.2 Time plots

Now let us visualise how the the number of passengers has evolved in time for the different modes of transportation in order to add more insight.



As before, it is impossible to observe any trends for the 4 least used modes of transportation. Also, the trend for the 4 most used forms of transportation are flattened as the y-axis starts at zero.

We view them separately,



We see that there are distinctive features for these modes of transportation:

  • Trackless trolley has some unusually low observations during 6 consecutive months in 2010. This is possibly due to the service not being fully available due to construction, etc.
  • Boat is seasonal, with highest numbers travelling by boat in the summer.
  • Private Bus has seen a decline in 2009, possibly due to lack of funding. Trackless Trolley also has a decline since 2009.
  • The trend for RIDE is upwards, possibly more funding available.

For discussion: How would you improve the plots above?

For example, remove the x-axis name “date” and the legend name “mode” as these are obvious.

I personally like the background grid because it makes it easier to detect changes in trends and when they happened. Given the option of a hover tool which displays information of the data, perhaps the vertical grid can be erased.

Let’s try with a white background with only a horizontal grid. Remove the legend title (as it’s obvious in this case)



This plot is less busy therefore more satisfactory. Once you arrive to a display that conveys the information you want, ask yourself, are there any elements superfluous? If I remove an element, such a grid line or a background colour, is it still easy to convey the story about the data?

The time plots for the 4 most used modes of transportation is below. Note how much better the trends can be viewed compared to the previous plot where the y-axis had a zero origin.

2.3 Bar plots

We can use bar plots to display the data.

Let us consider just the data for Boat and Trackless Trolley.

Now we will produce a stacked bar plot. Note that the coordinates are flipped as I think that in this case it is easier to see the data like this.



  • It’s obvious that the columns correspond to months so I don’t label the y-axis.
  • Note that I have ordered months from January in the top to December in the bottom. This is because, at least in this part of the world, we tend to read and order from top to bottom.
  • Note that we can get rid of all the tick marks of the x-axis since the bars are annotated with the numbers. I prefer annotations to axis tick marks and labels.
  • Instead of an x-axis label I put a title to the plot.



Perhaps in this situation it is best to display the data as a grouped bar chart rather than a stacked bar chart because the grouped bar chart allows at a glance to see that there are more passengers travelling by trackless trolley than boat and that boat is seasonal.



  • I think it’s best NOT to flip this bar chart.

  • Note that I got rid of the grid entirely. This is because the bars are annotated.

  • Now months should be ordered from left to right.



  • Make sure plots convey a message and you can tell a story by allowing data to display fully. If needed, produce more than one plot.
  • The y-axis doesn’t necessarily have to include 0. Whether it does or not depends on the scale of the displayed data.
  • If the range of the y-axis is not in line with the range of the data the plot will convey distortions, hide, possibly key features of the data.
  • Less is more: remove unnecessary elements from a graph without compromising the clarity of the information displayed.

3 Case study: Association of qualitative variables. Berkeley admissions.

We consider graduate admission figures for the autumn of 1973 at the University of California, Berkeley. The numbers, shown below, seem to imply that men applying for post-graduate studies were more likely than women to be admitted. It was argued that the difference was so large that it was unlikely to be due to chance. The data set is usually presented as follows:

Berkeley admissions data by gender and admission status
Applicants Admitted
Men 8442 44%
Women 4321 35%

The data above has been aggregated through departments. However, it is known that different genders have different preferences of departments they apply to (women have a preference for e.g. psychology or English studies, men have a preference for e.g. engineering studies).

When examining the individual departments, it appeared that six out of 85 departments were significantly biased against men, whereas only four were significantly biased against women. The data from the six largest departments are listed below.

Berkeley admissions data with admissions status by department
Department Applied_men Admitted_men Applied_women Admitted_women
A 825 62% 108 82%
B 560 63% 25 68%
C 325 37% 593 34%
D 417 33% 375 35%
E 191 28% 393 24%
F 373 6% 341 7%

3.1 Mosaic plots

We visualise this data set of categorical, non-ordinal data using a mosaic plot. Mosaic plots are useful for visualizing proportions in more than 2 dimensions.



The heights and lengths of each mosaic are proportional to the proportions in the margins. So, a very flat rectangle indicates, proportionally, very few applicants of the corresponding gender in a given department. A long rectangle in the admitted status indicates that, proportionally, it’s not so difficult to be accepted in the corresponding department. As we can see there is no evidence for a discrimination case. In Departments A and B applicants are mainly male but in Department A, proportionally, more female than male applicants were admitted. Department F is very competitive and has a high rejection rate, which applies nearly equally to both female and male applicants.

This mosaic also shows the explanation: Selective departments have more female applicants. It’s easy to see since the departments are ordered by selectiveness. Departments A and B let in many applicants, but they’re mostly male. The reverse is true for the rest. This means that the overall female population takes big admittance hits in departments C through F, while lots of males get in via departments A and B.

One of the perils when studying associations between a variable of interest and a set of explanatory variables is overfitting. If we use too many explanatory variables we may explain very well the observed values of the variable of interest but nothing else and so our study will have little predictive value.

Problems also occur when relevant explanatory variables are ignored. It is possible that when one ignores a relevant variable one observes an effect and when the variable is considered the opposite effect is observed. This is called Simpson’s paradox. What we have explored with the Berkeley graduate admissions data is one of the best-known examples of Simpson’s paradox.

3.2 Treemap

Let us visualise the Berkeley admissions data using a treemap.



Like mosaic plots, a treemap visually displays proportions by varying the area of a rectangular shape. In a treemap, you can display hundreds, or thousands, of pieces of information. In a treemap you must arrange your data elements hierarchically using categorical variables, in a meaningful way for the information you want to display. The data you need is:

  • A quantitative variable that has positive values. This will be used to calculate the area of the rectangles in the treemap.
  • One or more categorical variables associated with that quantitative variable allowing to group the data.

In the Berkeley admission data, the quantitative variable is “Number admitted”. The categorical variables are “Department”, “Admission Status”, and “Gender”, with that order of nesting (i.e. Gender within Admission Status within Department).

Note that the nesting or hierarchy is not always unique (you could nest admission status within department, for example). Therefore, you must think about what information you want to display and which nesting is most adequate.

For example, you could have procurement in government departments and each department has individual projects with a cost. There is only one natural hierarchy here, namely project within department.


For discussion:
Would you change anything in the treemap? Which plot do you think conveys better the information about admissions and gender association?


4 Case study: other ways of showing time variation - HIV prevalence

We will use data about Adults with HIV in Africa (estimated prevalence of HIV in percentage, ages 15-49) from Gapminder, 1990-2011.

The data consists of yearly HIV prevalence by country as well as income (GDP per capita, PPP$ inflation-adjusted) and population size.

4.1 Scatter plots



For discussion: What would you change in this plot?


Income is a highly skewed variable as many countries have low to medium incomes and very few have very high incomes. Therefore, it is difficult to see the information contained in the scatter plots as the points are cluttered towards low income values.

We will apply a logarithmic (base 10) transformation to income. The logarithm is an increasing function and so the order in the x-axis will be preserved.



Most African countries have prevalence values in a scale which is about ten times than that of the rest of the world. This makes the visualisation difficult. It’s best to visualise the data for African countries separately.


For discussion: Can you think of any other ways of dealing with highly skewed variables? What would you change in the above plot?


  • Another way to deal with a highly skewed continuous variable is to convert it into discrete by defining suitable intervals within its range. This is not straight forward and certainly the length of the intervals will be varying. Something to take into account is to choose interval lengths so that the number of observations in not excessively high in just a few of them.
  • The scientific notation for income values (x-axis) is not very friendly.



Let us view the plot for Africa only.


To gain more insight, let us identify the African countries with HIV prevalence greater than or equal to 10%. We add labels that do not overlap.


For discussion: What would you change in this plot?

I will make the background color white and remove vertical grid lines. I will leave the frame around each plot because in faceted plots it’s good to know each of the plot boundaries.


We can also produce a dynamic plot showing one frame for each year. Follow Equatorial Guinea (gnq) in the bottom right and observe how the country becomes richer and its HIV prevalence increases.



For discussion: What would you change in this plot?


We can add a further variable, population size, with diameter of dots proportional to population size.


For discussion:

For discussion: What would you change in this plot?

Note that the legend title should be there as if it’s not it’s not clear why the dots have different sizes.


4.2 Connected scatter plot

Let us follow the evolution of GDP and HIV prevalence in Equatorial Guinea.


A plain scatter plot is misleading because the points should be ordered by Year, not by GDP.


One way to add the time dimension when plotting two time series against each other is to add arrows indicating time evolution and time labels.


GDP and HIV prevalence have both been increasing in Equatorial Guinea until 2005. Note that the arrows only indicate the direction of joint evolution, not a correlation. In particular the arrows will be useful to explore the evolution after 2004.



HIV prevalence has increased steadily, except during 2008. GDP didn’t grow during 2005 and 2008-2010.


4.3 Choropleth maps

Given the geographical nature of the data, it suits itself for displaying it in a choropleth map.

The choropleth map below displays HIV prevalence in Africa in 2010.

4.4 Animated choropleth map

To see the evolution of HIV prevalence over time, we can animate the choropleth map, showing one frame per year.


For discussion: Discuss the merits and suitability of choropleth maps and compare to the other visualizations discusssed in this section.

5 Flow charts, Sankey chart - Research and development funding UK, 2019

The data source is ONS.

5.1 Sankey chart

We need data on two categorical variables, at least. One is the source and the other one is the target. Then we need a quantitative variable with amount flowing from source to target.


The thickness of the curves is proportional to the value flowing from node 1 (Funding source) to node 2 (Funding target).

One can have many nodes in a Sankey chart (the above has only 2 nodes).

The bars at each node are usually not displayed in Sankey charts.


For discussion

For discussion: What would you change in this plot? Tell a story from the above Sankey chart.


6 Waterfall chart - Net profit from sales

A waterfall chart illustrates how different quantitative elements contribute to a total. A waterfall chart disaggregates all of the unique components that contribute to a net change visualising them individually.

6.1 Waterfall chart

In the next example we use some fictitious sales data.



For discussion: How can you use waterfall charts for other applications besides sales?


7 Area and stream charts - COVID-19 effect on patient access to NHS diagnostics

Here we also view how a part contributes to a total.

Data from NHS (up to Nov 2021)

7.1 Area plot


7.2 Streamplot

It’s more striking to view the data as a streamplot. A streamplot is like an area plot, except it is symmetric around zero.


For discussion: Discuss the differences between an area plot and a streamplot. When would you use one or the other?